feat: add metric pgrst_jwt_cache_size in admin server #3802

taimoorzaeem · 2024-11-26T17:51:02Z

Add metric pgrst_jwt_cache_size in admin server which shows the cache size in bytes.

steve-chavez · 2024-11-26T18:45:01Z

Don't forget the feedback about the actual size in bytes #3801 (comment)

Ultimately, I believe we're going to need an LRU cache for the JWT cache to be production ready. So having this metric in bytes will be useful now and later.

taimoorzaeem · 2024-11-28T17:46:34Z

Don't forget the feedback about the actual size in bytes #3801 (comment)

Calculating "actual" byte size of cache is sort of tricky (haskell laziness is a lot to deal with sometimes) and I still haven't figured it out YET. In the meantime, I have written some code to approximate the cache size in bytes.

It works as follows:

Data.Cache gives a function toList which returns that cache entries in a list of tuples ([(ByteString, AuthResult, Maybe TimeSpec)] in our case).

Now, we can use ghc-datasize library code to calculate the byte size of ByteString and AuthResult but not Maybe TimeSpec (because recursiveSizeNF only works on types that are an instance of NFData typeclass, hence I am calling it an "approximation").

steve-chavez · 2025-01-18T23:28:00Z

This is pretty cool, so I do see the size starting at 0 then increasing as I do requests:

# $ nix-shell
# $ PGRST_ADMIN_SERVER_PORT=3001  PGRST_JWT_CACHE_MAX_LIFETIME=30000 postgrest-with-postgresql-16  -f test/spec/fixtures/load.sql postgrest-run

$ curl localhost:3001/metrics
# HELP pgrst_jwt_cache_size The number of cached JWTs
# TYPE pgrst_jwt_cache_size gauge
pgrst_jwt_cache_size 0.0

$ curl localhost:3000/authors_only -H "Authorization: Bearer $(postgrest-gen-jwt --exp 10 postgrest_test_author)"
[]
$ curl localhost:3001/metrics
..
pgrst_jwt_cache_size 72.0

$ curl localhost:3000/authors_only -H "Authorization: Bearer $(postgrest-gen-jwt --exp 10 postgrest_test_author)"
[]
$ curl localhost:3001/metrics
..
pgrst_jwt_cache_size 144.0

$ curl localhost:3000/authors_only -H "Authorization: Bearer $(postgrest-gen-jwt --exp 10 postgrest_test_author)"
[]
$ curl localhost:3001/metrics
..
pgrst_jwt_cache_size 216.0

Of course this doesn't drop down after a while because we need #3801 for that.

One issue that I've noticed is that we're printing empty log lines for each request:

[nix-shell:~/Projects/postgrest]$ PGRST_ADMIN_SERVER_PORT=3001  PGRST_JWT_CACHE_MAX_LIFETIME=30000 postgrest-with-postgresql-16  -f test/spec/fixtures/load.sql postgrest-run
...
18/Jan/2025:18:16:32 -0500: Schema cache loaded in 17.1 milliseconds
18/Jan/2025:18:16:34 -0500: 
18/Jan/2025:18:16:38 -0500: 
18/Jan/2025:18:16:42 -0500:

This is due to the addition of the new Observation

postgrest/src/PostgREST/Observation.hs

Line 60 in 9a23b43

| JWTCache Int

And the following line here:

postgrest/src/PostgREST/Observation.hs

Line 146 in 9a23b43

_ -> mempty

This is surprising behavior, I'll try to refactor this on a new PR.

For now, how about printing a message like:

JWTCache sz-> "The JWT Cache size increased to " <> sz <> "bytes"

This should only happen for a log-level greater than debug, check how this is done on:

postgrest/src/PostgREST/Logger.hs

Lines 79 to 96 in 9a23b43

    
           observationLogger :: LoggerState -> LogLevel -> ObservationHandler 
        
           observationLogger loggerState logLevel obs = case obs of 
        
             o@(PoolAcqTimeoutObs _) -> do 
        
               when (logLevel >= LogError) $ do 
        
                 logWithDebounce loggerState $ 
        
                   logWithZTime loggerState $ observationMessage o 
        
             o@(QueryErrorCodeHighObs _) -> do 
        
               when (logLevel >= LogError) $ do 
        
                 logWithZTime loggerState $ observationMessage o 
        
             o@(HasqlPoolObs _) -> do 
        
               when (logLevel >= LogDebug) $ do 
        
                 logWithZTime loggerState $ observationMessage o 
        
             PoolRequest -> 
        
               pure () 
        
             PoolRequestFullfilled -> 
        
               pure () 
        
             o -> 
        
               logWithZTime loggerState $ observationMessage o

steve-chavez · 2025-01-18T23:36:56Z

Calculating "actual" byte size of cache is sort of tricky (haskell laziness is a lot to deal with sometimes) and I still haven't figured it out YET. In the meantime, I have written some code to approximate the cache size in bytes.

The above is understandable. What we really need is a good enough approximation so we have an order-of-magnitude understanding to see if the cache size is in KB, MB or GB. So far this seems enough. We should definitely document why we do an approximation though.

src/PostgREST/Metrics.hs

src/PostgREST/App.hs

steve-chavez · 2025-01-18T23:57:45Z

I've just noticed that perf badly drops down on this PR:

param	v12.2.3	head	main
throughput	448	121	399

https://github.com/PostgREST/postgrest/pull/3802/checks?check_run_id=34517395528

The recursiveSize function says:

This function works very quickly on small data structures, but can be slow on large and complex ones. If speed is an issue it's probably possible to get the exact size of a small portion of the data structure and then estimate the total size from that.

@taimoorzaeem What could we do to avoid this drop? Maybe calculate the cache size periodically on a background thread? Any thoughts?

src/PostgREST/Utils.hs

src/PostgREST/Auth.hs

taimoorzaeem · 2025-01-19T14:46:54Z

Calculating "actual" byte size of cache is sort of tricky (haskell laziness is a lot to deal with sometimes) and I still haven't figured it out YET. In the meantime, I have written some code to approximate the cache size in bytes.

The above is understandable. What we really need is a good enough approximation so we have an order-of-magnitude understanding to see if the cache size is in KB, MB or GB. So far this seems enough. We should definitely document why we do an approximation though.

Now that I have gotten better at haskell 🚀, I solved the issue and we don't need an "approximation". We CAN calculate full cache size in bytes.

taimoorzaeem · 2025-01-19T18:47:32Z

Load test and memory test failing with:

src/PostgREST/Auth.hs:44:1: error:
    Could not find module ‘GHC.DataSize’
    Perhaps you haven't installed the profiling libraries for package ‘ghc-datasize-0.2.7’?
    Use -v (or `:set -v` in ghci) to see a list of the files searched for.
   |
44 | import GHC.DataSize            (recursiveSizeNF)

Is there any additional configuration that needs to be added for profiling?

wolfgangwalther · 2025-01-19T18:56:11Z

nix/overlays/haskell-packages.nix

+      # nixpkgs have ghc-datasize-0.2.7 marked as broken
+      ghc-datasize = lib.markUnbroken prev.ghc-datasize;


Suggested change

# nixpkgs have ghc-datasize-0.2.7 marked as broken

ghc-datasize = lib.markUnbroken prev.ghc-datasize;

# TODO: Remove this once https://github.com/NixOS/nixpkgs/pull/375121

# has made it to us.

ghc-datasize = lib.markUnbroken prev.ghc-datasize;

wolfgangwalther · 2025-01-19T19:01:10Z

Load test and memory test failing with:

Does it happen locally, too?

I am confused, can't exactly spot what's wrong right now.

taimoorzaeem · 2025-01-19T19:19:04Z

Does it happen locally, too?

I am confused, can't exactly spot what's wrong right now.

Yes, it does happen locally.

[nix-shell]$ postgrest-loadtest
...
src/PostgREST/Auth.hs:44:1: error:
    Could not find module ‘GHC.DataSize’
    Perhaps you haven't installed the profiling libraries for package ‘ghc-datasize-0.2.7’?
    Use -v (or `:set -v` in ghci) to see a list of the files searched for.
   |
44 | import GHC.DataSize            (recursiveSizeNF)
   | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

taimoorzaeem · 2025-01-20T17:29:53Z

Need some way to run postgrest-loadtest on the CI. Currently it is failing because of building with Nix. Running PGRST_BUILD_CABAL=1 posgrest-loadtest works locally but I am not sure how to set this up on CI temporarily to check loadtest results. Does running loadtest on CI equivalent to running it locally? Would I get same results?

wolfgangwalther · 2025-01-20T17:39:35Z

Yeah, running the loadtest in CI is not working when dependencies are changed, because it needs to run against the base branch, which doesn't have the dependencies, yet... and then cabal just breaks it somehow.

You can run something like this locally to get the same markdown output:

          postgrest-loadtest-against main v12.2.5
          postgrest-loadtest-report

(but it's likely that it fails the same way... :D)

wolfgangwalther · 2025-01-20T17:58:16Z

Perhaps you haven't installed the profiling libraries for package ‘ghc-datasize-0.2.7’?

The thing I don't understand about this error message is, that it appears in contexts where we don't use profiling libs. We do for the memory test, so the error message makes "kind of sense" (I still don't understand why it happens, though). But for loadtest and the regular build on MacOS... those don't use profiling.

Hm... actually - we don't do a regular dynamically linked linux build via nix, I think. So in fact it fails for every nix build except the static build. Still don't know what's happening, though.

steve-chavez · 2025-01-21T18:22:57Z

[nix-shell]$ postgrest-loadtest
...
src/PostgREST/Auth.hs:44:1: error:

I get a similar error when trying this locally too.

Yeah, running the loadtest in CI is not working when dependencies are changed, because it needs to run against the base branch, which doesn't have the dependencies, yet... and then cabal just breaks it somehow.

To unblock the PR, how about only adding the dependency in another PR and then merging it? Then I assume this PR would run the loadtest?

wolfgangwalther · 2025-01-21T18:48:08Z

To unblock the PR, how about only adding the dependency in another PR and then merging it? Then I assume this PR would run the loadtest?

No I don't think so. It seems the dependency issue was fixed a while ago in 0c5d2e5. Also the fact that all nix builds fail, the memory test, on darwin, etc. - indicates that there is something else going on.

nix/overlays/haskell-packages.nix

docs/references/observability.rst

src/PostgREST/App.hs

src/PostgREST/Internal.hs

steve-chavez

Great work 🚀

The codecov failures look related to Internal.hs, which should go once #3881 is solved.

src/PostgREST/Auth.hs

steve-chavez · 2025-01-27T21:37:23Z

src/PostgREST/Auth.hs

+  -- if token not found, add to cache and increment cache size metric
  case (authResult,checkCache) of
-    (Right res, Nothing) -> C.insert' (getJwtCache appState) (getTimeSpec res maxLifetime utc) token res
+    (Right res, Nothing) -> do
+      let tSpec = getTimeSpec res maxLifetime utc
+      C.insert' jwtCache (Just tSpec) token res
+      entrySize <- calcCacheEntrySizeInBytes (token, res, tSpec)
+      observer $ JWTCache entrySize -- adds size to metric


I still don't understand how is the incremental calculation done. Where is the JWT cache size stored and then incremented? My thinking is that the cache size should be stored somewhere in AppState, but there's nothing added there.

Wait, I think this PR isn't quite ready yet. I am now thinking of a better design. I'll take another look tomorrow.

taimoorzaeem · 2025-01-28T08:30:29Z

@steve-chavez I think we need a design decision here. We are currently allowing our in-memory JWT cache to grow infinitely larger with no upper bound. I think this is bad because it will cause memory exhaustion and impact performance. We also don't need an explicit expiration support that Data.Cache gives us because we already have exp claim for that.

Switch to LRU Cache

As you said earlier that we need an LRU cache for caching to be production ready, how about we switch our cache implementation to https://hackage.haskell.org/package/lrucache.This would also mean that we need a potentially configurable maximum cache size (add new config?). WDYT?

steve-chavez · 2025-01-28T15:48:29Z

@taimoorzaeem Yes, that is the end goal. However wouldn't that imply a breaking change? It would require a new major version.

Since the current JWT cache is broken (#3788), we should release a new minor with the fix and then do the LRU cache for the new major.

wolfgangwalther · 2025-01-28T15:50:48Z

we should release a new minor with the fix

We can't release a new minor version. We can only release a new patch version on the v12 branch or a new major on the main branch. We still have a few breaking changes that we've been carrying for quite a while on main (removal of PostgreSQL), and we have never made a release for v13.

steve-chavez · 2025-01-28T16:01:28Z

How should we handle this in v12 then? Maybe put a "danger" note on https://postgrest.org/en/v12/references/auth.html#jwt-caching, telling that the feature is broken and should only be used on v13?

wolfgangwalther · 2025-01-28T16:06:24Z

How should we handle this in v12 then? Maybe put a "danger" note on https://postgrest.org/en/v12/references/auth.html#jwt-caching, telling that the feature is broken and should only be used on v13?

Can we not run purgeExpired periodically? Why do we need this feature for it?

steve-chavez · 2025-01-28T16:12:37Z

Can we not run purgeExpired periodically? Why do we need this feature for it?

There's no way to test that the fix is working correctly without this. Check https://github.com/PostgREST/postgrest/pull/3801/files#r1855025853. There the feature was added on the same PR as the fix.

We can't release a new minor version. We can only release a new patch version on the v12 branch or a new major on the main branch. We still have a few breaking changes that we've been carrying for quite a while on main (removal of PostgreSQL), and we have never made a release for v13.

Why is the above the case btw? We could not pick the breaking changes for the minor.

wolfgangwalther · 2025-01-28T16:21:03Z

Can we not run purgeExpired periodically?

There's no way to test that the fix is working correctly without this.

If we can use this feature to test that a separate commit as a fix is working on the main branch, we might just backport the fix without the test.

Why is the above the case btw? We could not pick the breaking changes for the minor.

Partly because that's the workflow we agreed on. We are backporting fixes, not features. But there's reason behind it:

Once you start backporting features, you are going to backport a lot more code. Does this feature apply cleanly on v12? Do we need to backport other features first? Why do we release a minor with some features, but not with others? The complexity just increases incredibly.
The risk increases. Every new feature increases the risk to introduce new bugs. We agreed to backport fixes to enable us to do two things at the same time: Iterate quickly with new stuff on the main branch, but also stabilize the currently stable release. We were counter-acting a pattern we had back then: After a new minor/major, we always had to quickly ship some bugfixes - but if our main branch already advanced past the next patch release, we were not able to anymore.

I fully agree we need to fix the bug. But we should not introduce new risks by doing so.

steve-chavez · 2025-01-28T16:34:53Z

Partly because that's the workflow we agreed on. We are backporting fixes, not features. But there's reason behind it:

Agree, thanks for the reminder.

If we can use this feature to test that a separate commit as a fix is working on the main branch, we might just backport the fix without the test.

Sounds good. I'll just cherry-pick this PR commit locally while reviewing #3801 then.

@taimoorzaeem I suggest keeping the LRU cache improvement after #3801 is merged. We should clear that first.

taimoorzaeem · 2025-01-28T17:07:31Z

@taimoorzaeem I suggest keeping the LRU cache improvement after #3801 is merged. We should clear that first.

Sure, sounds good! 👍

taimoorzaeem force-pushed the metric/jwt-cache-size branch from 1c8fef6 to 9a23b43 Compare November 28, 2024 17:37

taimoorzaeem mentioned this pull request Dec 12, 2024

fix: jwt cache is not purged #3801

Merged

taimoorzaeem force-pushed the metric/jwt-cache-size branch 2 times, most recently from 1254ed5 to 11727ab Compare December 17, 2024 06:50

steve-chavez reviewed Jan 18, 2025

View reviewed changes

src/PostgREST/Metrics.hs Outdated Show resolved Hide resolved

steve-chavez reviewed Jan 18, 2025

View reviewed changes

src/PostgREST/App.hs Outdated Show resolved Hide resolved

steve-chavez reviewed Jan 18, 2025

View reviewed changes

src/PostgREST/Utils.hs Outdated Show resolved Hide resolved

steve-chavez reviewed Jan 19, 2025

View reviewed changes

src/PostgREST/Auth.hs Outdated Show resolved Hide resolved

taimoorzaeem commented Jan 19, 2025

View reviewed changes

src/PostgREST/Auth.hs Outdated Show resolved Hide resolved

taimoorzaeem force-pushed the metric/jwt-cache-size branch from 11727ab to 145c236 Compare January 19, 2025 14:41

taimoorzaeem force-pushed the metric/jwt-cache-size branch 3 times, most recently from 4b1afd1 to d9de43a Compare January 19, 2025 18:11

wolfgangwalther reviewed Jan 19, 2025

View reviewed changes

taimoorzaeem force-pushed the metric/jwt-cache-size branch 2 times, most recently from 7ff02db to 6eb7d3c Compare January 20, 2025 17:19

taimoorzaeem marked this pull request as ready for review January 27, 2025 06:48

steve-chavez reviewed Jan 27, 2025

View reviewed changes

nix/overlays/haskell-packages.nix Outdated Show resolved Hide resolved

steve-chavez reviewed Jan 27, 2025

View reviewed changes

docs/references/observability.rst Outdated Show resolved Hide resolved

steve-chavez mentioned this pull request Jan 27, 2025

Nix loadtests and memory tests fail when adding GHC.DataSize #3881

Open

steve-chavez reviewed Jan 27, 2025

View reviewed changes

src/PostgREST/App.hs Show resolved Hide resolved

steve-chavez reviewed Jan 27, 2025

View reviewed changes

src/PostgREST/Internal.hs Show resolved Hide resolved

taimoorzaeem marked this pull request as draft January 27, 2025 16:52

taimoorzaeem force-pushed the metric/jwt-cache-size branch 2 times, most recently from b5331af to de80a9e Compare January 27, 2025 17:32

taimoorzaeem marked this pull request as ready for review January 27, 2025 17:34

taimoorzaeem force-pushed the metric/jwt-cache-size branch from de80a9e to 7034d38 Compare January 27, 2025 18:00

steve-chavez approved these changes Jan 27, 2025

View reviewed changes

steve-chavez reviewed Jan 27, 2025

View reviewed changes

src/PostgREST/Auth.hs Outdated Show resolved Hide resolved

steve-chavez reviewed Jan 27, 2025

View reviewed changes

src/PostgREST/Auth.hs Show resolved Hide resolved

taimoorzaeem force-pushed the metric/jwt-cache-size branch from 7034d38 to df3b557 Compare January 27, 2025 19:26

taimoorzaeem requested a review from steve-chavez January 27, 2025 19:28

steve-chavez reviewed Jan 27, 2025

View reviewed changes

taimoorzaeem marked this pull request as draft January 27, 2025 21:53

feat: add metric pgrst_jwt_cache_size_bytes in admin server

1d88308

taimoorzaeem force-pushed the metric/jwt-cache-size branch from df3b557 to 1d88308 Compare January 29, 2025 05:47

taimoorzaeem mentioned this pull request Jan 31, 2025

perf: Purge JWT cache asynchronously in a separate thread #3889

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add metric pgrst_jwt_cache_size in admin server #3802

feat: add metric pgrst_jwt_cache_size in admin server #3802

taimoorzaeem commented Nov 26, 2024 •

edited

Loading

steve-chavez commented Nov 26, 2024

taimoorzaeem commented Nov 28, 2024

steve-chavez commented Jan 18, 2025

steve-chavez commented Jan 18, 2025

steve-chavez commented Jan 18, 2025

taimoorzaeem commented Jan 19, 2025

taimoorzaeem commented Jan 19, 2025 •

edited

Loading

wolfgangwalther Jan 19, 2025

wolfgangwalther commented Jan 19, 2025

taimoorzaeem commented Jan 19, 2025

taimoorzaeem commented Jan 20, 2025

wolfgangwalther commented Jan 20, 2025 •

edited

Loading

wolfgangwalther commented Jan 20, 2025

steve-chavez commented Jan 21, 2025

wolfgangwalther commented Jan 21, 2025

steve-chavez left a comment

steve-chavez Jan 27, 2025 •

edited

Loading

taimoorzaeem Jan 27, 2025

taimoorzaeem commented Jan 28, 2025 •

edited

Loading

steve-chavez commented Jan 28, 2025

wolfgangwalther commented Jan 28, 2025

steve-chavez commented Jan 28, 2025

wolfgangwalther commented Jan 28, 2025

steve-chavez commented Jan 28, 2025 •

edited

Loading

wolfgangwalther commented Jan 28, 2025

steve-chavez commented Jan 28, 2025

taimoorzaeem commented Jan 28, 2025

		# nixpkgs have ghc-datasize-0.2.7 marked as broken
		ghc-datasize = lib.markUnbroken prev.ghc-datasize;

feat: add metric pgrst_jwt_cache_size in admin server #3802

Are you sure you want to change the base?

feat: add metric pgrst_jwt_cache_size in admin server #3802

Conversation

taimoorzaeem commented Nov 26, 2024 • edited Loading

steve-chavez commented Nov 26, 2024

taimoorzaeem commented Nov 28, 2024

steve-chavez commented Jan 18, 2025

steve-chavez commented Jan 18, 2025

steve-chavez commented Jan 18, 2025

taimoorzaeem commented Jan 19, 2025

taimoorzaeem commented Jan 19, 2025 • edited Loading

wolfgangwalther Jan 19, 2025

Choose a reason for hiding this comment

wolfgangwalther commented Jan 19, 2025

taimoorzaeem commented Jan 19, 2025

taimoorzaeem commented Jan 20, 2025

wolfgangwalther commented Jan 20, 2025 • edited Loading

wolfgangwalther commented Jan 20, 2025

steve-chavez commented Jan 21, 2025

wolfgangwalther commented Jan 21, 2025

steve-chavez left a comment

Choose a reason for hiding this comment

steve-chavez Jan 27, 2025 • edited Loading

Choose a reason for hiding this comment

taimoorzaeem Jan 27, 2025

Choose a reason for hiding this comment

taimoorzaeem commented Jan 28, 2025 • edited Loading

Switch to LRU Cache

steve-chavez commented Jan 28, 2025

wolfgangwalther commented Jan 28, 2025

steve-chavez commented Jan 28, 2025

wolfgangwalther commented Jan 28, 2025

steve-chavez commented Jan 28, 2025 • edited Loading

wolfgangwalther commented Jan 28, 2025

steve-chavez commented Jan 28, 2025

taimoorzaeem commented Jan 28, 2025

taimoorzaeem commented Nov 26, 2024 •

edited

Loading

taimoorzaeem commented Jan 19, 2025 •

edited

Loading

wolfgangwalther commented Jan 20, 2025 •

edited

Loading

steve-chavez Jan 27, 2025 •

edited

Loading

taimoorzaeem commented Jan 28, 2025 •

edited

Loading

steve-chavez commented Jan 28, 2025 •

edited

Loading